COVID-19 outbreak analysis in Germany with R

https://github.com/rozgicm/CovidAnalysis

The conducted analysis follows the posts by Tim Churches, starting with his post in February 2020: https://timchurches.github.io/blog/posts/2020-02-18-analysing-covid-19-2019-ncov-outbreak-data-with-r-part-1/#estimating-changes-in-the-effective-reproduction-number

The code is slightly changed, some graphs are tweaked. All in all this is supposed to help scientists as well as non-scientists to gain insights and conduct their own analysis of the situation. The main code is found in covidAnalysis.R. I will try to show some results here on the main page. Please, feel free to contribute.

Disclaimer

I am not a medical doctor, I am only a data-dude who wants to help citizen data scientist to stay informed and check the numbers we are confronted with every day. If you, on the other hand, are someone who understands more about epidemiology, feel free to use all you find here. Remember, most of it is presented in a much better way by Tim Churches. If you feel interested in doing some data analysis yourself and want to add, please feel free, that’s what GitHub is for.

Data Acquisition

Data are pulled from JHU GitHub archive https://raw.githubusercontent.com/CSSEGISandData/. R allows for a fairly easy way to get the data from the GitHub archive. The JHU data are nicely formatted and thus suitable for a quick analysis.

jhuUrl = paste("https://raw.githubusercontent.com/CSSEGISandData/",
                 "COVID-19/master/csse_covid_19_data/", "csse_covid_19_time_series/",
                 "time_series_covid19_confirmed_global.csv", sep = "")

dat =  read_csv(jhuUrl) %>% rename(province = "Province/State",
                                    country = "Country/Region")

head(dat)
## # A tibble: 6 x 79
##   province country   Lat   Long `1/22/20` `1/23/20` `1/24/20` `1/25/20`
##   <chr>    <chr>   <dbl>  <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 <NA>     Afghan…  33    65            0         0         0         0
## 2 <NA>     Albania  41.2  20.2          0         0         0         0
## 3 <NA>     Algeria  28.0   1.66         0         0         0         0
## 4 <NA>     Andorra  42.5   1.52         0         0         0         0
## 5 <NA>     Angola  -11.2  17.9          0         0         0         0
## 6 <NA>     Antigu…  17.1 -61.8          0         0         0         0
## # … with 71 more variables: `1/26/20` <dbl>, `1/27/20` <dbl>, `1/28/20` <dbl>,
## #   `1/29/20` <dbl>, `1/30/20` <dbl>, `1/31/20` <dbl>, `2/1/20` <dbl>,
## #   `2/2/20` <dbl>, `2/3/20` <dbl>, `2/4/20` <dbl>, `2/5/20` <dbl>,
## #   `2/6/20` <dbl>, `2/7/20` <dbl>, `2/8/20` <dbl>, `2/9/20` <dbl>,
## #   `2/10/20` <dbl>, `2/11/20` <dbl>, `2/12/20` <dbl>, `2/13/20` <dbl>,
## #   `2/14/20` <dbl>, `2/15/20` <dbl>, `2/16/20` <dbl>, `2/17/20` <dbl>,
## #   `2/18/20` <dbl>, `2/19/20` <dbl>, `2/20/20` <dbl>, `2/21/20` <dbl>,
## #   `2/22/20` <dbl>, `2/23/20` <dbl>, `2/24/20` <dbl>, `2/25/20` <dbl>,
## #   `2/26/20` <dbl>, `2/27/20` <dbl>, `2/28/20` <dbl>, `2/29/20` <dbl>,
## #   `3/1/20` <dbl>, `3/2/20` <dbl>, `3/3/20` <dbl>, `3/4/20` <dbl>,
## #   `3/5/20` <dbl>, `3/6/20` <dbl>, `3/7/20` <dbl>, `3/8/20` <dbl>,
## #   `3/9/20` <dbl>, `3/10/20` <dbl>, `3/11/20` <dbl>, `3/12/20` <dbl>,
## #   `3/13/20` <dbl>, `3/14/20` <dbl>, `3/15/20` <dbl>, `3/16/20` <dbl>,
## #   `3/17/20` <dbl>, `3/18/20` <dbl>, `3/19/20` <dbl>, `3/20/20` <dbl>,
## #   `3/21/20` <dbl>, `3/22/20` <dbl>, `3/23/20` <dbl>, `3/24/20` <dbl>,
## #   `3/25/20` <dbl>, `3/26/20` <dbl>, `3/27/20` <dbl>, `3/28/20` <dbl>,
## #   `3/29/20` <dbl>, `3/30/20` <dbl>, `3/31/20` <dbl>, `4/1/20` <dbl>,
## #   `4/2/20` <dbl>, `4/3/20` <dbl>, `4/4/20` <dbl>, `4/5/20` <dbl>

As can be seen, the data is in ‘wide’ format. I will be using the tidyverse package a lot, I mean a lot (!!), thus I will reshape the data into ‘long’ format:

datLong = dat %>%
  filter(country == "Germany") %>%  
  pivot_longer(-c(province,country, Lat, Long),
               names_to = "Date",
               values_to = "cumulative_cases")
head(datLong)
## # A tibble: 6 x 6
##   province country   Lat  Long Date    cumulative_cases
##   <chr>    <chr>   <dbl> <dbl> <chr>              <dbl>
## 1 <NA>     Germany    51     9 1/22/20                0
## 2 <NA>     Germany    51     9 1/23/20                0
## 3 <NA>     Germany    51     9 1/24/20                0
## 4 <NA>     Germany    51     9 1/25/20                0
## 5 <NA>     Germany    51     9 1/26/20                0
## 6 <NA>     Germany    51     9 1/27/20                1

To get a feeling how the number of infected people is evolving we can visualize the number of cumulative cases quickly. Since exponential growth is expected it makes sense to visualize the log-transformed number of cases.

datLong = datLong %>%
  select(-c(province, Lat, Long)) %>%
  mutate(Date = mdy(Date)) %>% 
  filter(cumulative_cases != 0) %>%
  mutate(incident_cases = c(0, diff(cumulative_cases)),
         myDay = 1:nrow(.),
         myWeek = week(Date)-3 )

p1 = datLong %>% ggplot(aes(x=Date, y = cumulative_cases)) +
  geom_line() +
  geom_point(shape = "x") +
  labs(title = glue("Number of COVID-19 infections in {datLong$country[1]}"),
       subtitle = glue("data ranging from {min(datLong$Date)} to {max(datLong$Date)}")
       ) +
  theme_lucid()

p2 = datLong %>% ggplot(aes(x=Date, y = log(cumulative_cases))) +
  geom_line() +
  geom_point(shape = "x") +
  labs(title = glue("Number of COVID-19 infections in {datLong$country[1]}"),
       subtitle = glue("data ranging from {min(datLong$Date)} to {max(datLong$Date)}"),
       caption = "log scale plot"
       ) +
  theme_lucid()

p1 %>% ggplotly(., dynamicTicks = TRUE)
p2 %>% ggplotly(., dynamicTicks = TRUE)

Some people are interested in the growth of cases after for example the 100th case. In most cases I see those figures with absolute numbers. I think this is kind of misleading. Using the the wppExplorer package I was able to obtain the total number of people in each European country.

datWorld =  dat %>% pivot_longer(-c(province,country, Lat, Long),
                          names_to = "Date", 
                          values_to = "cumulative_cases")
europe = read_csv2(file =paste0(pathData, "/countriesOfEurope.csv"))

# consider "Mainland" only for the time beeing
datEurope = filter(datWorld, country %in% europe$Countries)

datEurope = datEurope %>% 
  mutate(Date = mdy(Date)) %>% 
  group_by(country, Date) %>% 
    summarise(sumCumCases = sum(cumulative_cases)) %>% 
  ungroup() 


tPop = readRDS(file = paste0(pathData, "/tPop.RDS"))
cCodes = readRDS(file = paste0(pathData, "/countryCodes.RDS"))

datEurope = datEurope %>% 
  left_join(., cCodes %>%  select(name, charcode), 
            by = c("country"="name")) %>% 
  left_join(., tPop %>% 
              filter(Year =="2020") %>%
              select(-Year),
            by ="charcode") %>% 
  mutate(value = value * 1000)


datEurope = datEurope %>% filter(sumCumCases > 100) %>% 
  group_by(country) %>%
  group_modify(~{
    .x$daySince100 = 1:nrow(.x)
    return(.x)
  })%>% ungroup()


(datEurope %>% 
  ggplot(aes(x= daySince100, y = (sumCumCases/value)*100, colour = country, group = country) )+
           geom_line() +
           labs(x = "Number of days after reaching 100 cases",
                y =  "Percentage of population per country",
                title = "Development of cases in european countries after reaching 100 infections") +
           theme_lucid() +
  scale_colour_flat_d() )%>% ggplotly()

Implemented Features

As to this day 2020-04-06 only simple models are considered. That means, that no modelling of measures that are undertaken in included. So called ‘physical distancing’ is not included in any model, yet. I hope I will get to it, or maybe someone can include it ;-).

Linear modelling

When assuming exponential growth, it is a good idea to fit a linear model to the log transformed data. Above figure suggests that it is sensible to start modelling somewhere in the middle/end of February.

myLinearModel = lm(log(cumulative_cases) ~ myDay,
                   datLong %>%
                   filter(Date >= as.Date("2020-02-24"))
                   )
summary(myLinearModel)
## 
## Call:
## lm(formula = log(cumulative_cases) ~ myDay, data = datLong %>% 
##     filter(Date >= as.Date("2020-02-24")))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1916 -0.4199  0.2020  0.4268  0.7128 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.70632    0.36848  -7.344 6.24e-09 ***
## myDay        0.22017    0.00723  30.451  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.568 on 40 degrees of freedom
## Multiple R-squared:  0.9586, Adjusted R-squared:  0.9576 
## F-statistic: 927.2 on 1 and 40 DF,  p-value: < 2.2e-16
linModelDf = broom::tidy(myLinearModel)
datLong %>% filter(Date >= as.Date("2020-02-24") ) %>%
  ggplot(aes(x=myDay, y= log(cumulative_cases))) +
  geom_smooth(method = lm) +
  geom_point() +
  labs(title = glue("fitted linear model with intercept {round(linModelDf$estimate[1],2)} and slope {round(linModelDf$estimate[2],2)}"))+
  theme_lucid()

The computed slope can be used to compute growth rates (as far as I understood this is called \(R_{0}\) in epidemiology). From the above linear model follows

\(R^{\text{linear}}_0=\) 1.25

Further we can compute things like doubling times, by doing something like \[t_{\text{double}} = \frac{\operatorname{log}\left ( 2 \right ) }{\operatorname{log} \left ( R^{\text{linear}}_0 \right) }\].

log(2)/linModelDf$estimate[2]
## [1] 3.148193

And of course we can also make predictions. Please be careful, these are only predictions from a linear model!

startDay =  datLong %>%
  filter(Date >= as.Date("2020-02-24")) %>% pull(myDay) %>% min()
endDay = datLong %>%
  filter(Date >= as.Date("2020-02-24")) %>% pull(myDay) %>% max()

startDate = datLong %>%
  filter(Date >= as.Date("2020-02-24")) %>% pull(Date) %>% min()

endDate = datLong %>%
  filter(Date >= as.Date("2020-02-24")) %>% pull(Date) %>% max()

predDF = broom::tidy(predict(myLinearModel,
                             newdata = data.frame(myDay = startDay:(endDay+7)),
                             interval = "prediction"))

predDF$Date = seq(startDate,
                  max(datLong$Date)+days(7),
                  by = "day")


p4 = p1 + geom_ribbon(data = predDF,
                 inherit.aes = FALSE,
                 aes(ymin=exp(lwr), ymax = exp(upr), x = Date),
                 fill = "grey2", alpha =0.25) +
  geom_line(data = predDF,  inherit.aes = FALSE,
            aes(x= Date, y= exp(fit), colour = "fit")) +
  annotate("text",
           y = c(max(exp(predDF$fit)), max(exp(predDF$lwr)),max(exp(predDF$upr))),
           x = max(predDF$Date)+days(2),
           label = c(glue({round(max(exp(predDF$fit)))}),
                     glue({round(max(exp(predDF$lwr)))}),
                     glue({round(max(exp(predDF$upr)))})
                     )
           )+
  labs(title = glue("Number of COVID-19 infections in {datLong$country[1]} with a 7 day fit") )+
  theme(legend.title = element_blank())

p4 + facet_zoom(xlim= c(endDate, max(predDF$Date)))

SIR modelling

A more sophisticated way to model the outbreak can be performed by applying the SIR-Model (Susceptible Infectious Recovered). The model is based on an ODE system. See the code and Tim Churches’ posts for more details. However, the following can be obtained: